Welcome

This is a website for a book that serves as official documentation for OmicSHIELD. On it you will find: introductory references to learn about DataSHIELD and “resources,” explanation on the type of analysis that can be performed using OmicSHIELD and workflows (with reproducible code) of the main functionalities of OmicSHIELD.

All the funcionalities described on this book have been developed at the Bioinformatic Research Group in Epidemiology (BRGE) of ISGlobal with external help from Yannick Marcon (Obiba); and are part of OmicSHIELD.

The aim of this book is to be a showcase of the capabilities of the software we developed as well as being a quick reference guide for new researchers interested in this technology.

This material also serves as an online companion for the manuscript “NOMBRE DEL PAPER QUE ESCRIBIREMOS.”

This website is free to use, and is licensed under a MIT license.

I Preamble

1 Introduction

1.1 Materials to read beforehand

Along this book, there are some details regarding DataSHIELD and “resources” that are not explained in detail, it is expected that the reader is familiar with them. If that is not the case, there are other free online books/papers with that knowledge.

  • DataSHIELD paper: Description of what is DataSHIELD.

  • DataSHIELD wiki: Materials about DataSHIELD including:

    • Beginner material
    • Recorded DataSHIELD workshops
    • Information on current release of DataSHIELD
  • resource book: In this book you will find information about:

    • DataSHIELD (Section 5)
    • What are resources (Section 6/7)

1.2 What are “resources”: A very simple explanation without any technicalities

It is quite important to have a solid understanding of what are the “resources” and how we work with them, since in all the use cases we are interacting with them to load the Omic data on the R sessions. For that reason we included a very brief description of them without using technicalities.

The “resources” can be imagined as a data structure that contains the information about where to find a data set and the access credentials to it; we as DataSHIELD users are not able to look at this information (it is privately stored on the Opal server), but we can load it into our remote R session to make use of it. Following that, the next step comes naturally.

Once we have in an R session the information to access a dataset (an ExpressionSet for example) we have to actually retrieve it on the remote R session to analyze it. This step is called resolving the resource.

Those two steps can be identified on the code we provide as the following:

Loading the information of a “resource”:

DSI::datashield.assign.resource(conns, "resource", "resource.path.in.opal.server")

Resolving the “resource”:

DSI::datashield.assign.expr(conns, "resource.resolved", expr = as.symbol("as.resource.object(resource)"))

This toy code would first load the “resource” on a variable called resource and it would retrieve the information it contains and assign it to a variable called resource.resolved.

1.3 Capabilities of OmicSHIELD

The functionalities of OmicSHIELD are built on top of the “resources” to work with different types of data objects, more precisely we have developed capabilities to work with the following R objects:

  • ExpressionSet
  • RangedSummarizedExperiment
  • VCF/GDS (Genotype data containers)

These objects are analyzed using BioConductor packages as well as custom-made functions. This ensures that researchers familiar with the BioConductor universe will feel at home when using OmicSHIELD.

Not only we can work using a BioConductor approach, we also developed functionalities to make use of command line tools that are traditionally used on omics analysis, those are:

  • PLINK
  • SNPTEST

This allow the researchers to perform analysis on federated data using their own command line based pipelines. Again this ensures that people familiar with those tools will be able to perform analysis easily.

2 Omic data analysis: types of implemented analyses

The Figure 2.1 describes the different types of ’omic association analyses that can be performed using DataSHIELD client functions implemented in the dsOmicsClient package. Basically, data (’omic and phenotypes/covariates) can be stored in different sites (http, ssh, AWS S3, local, …) and are managed with Opal through the resourcer package and their extensions implemented in dsOmics.

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how the `resourcer` package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.

Figure 2.1: Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how the resourcer package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.

Then, dsOmicsClient package allows different types of analyses: pooled and meta-analysis. Both methods are based on fitting different Generalized Linear Models (GLMs) for each feature when assesing association between ’omic data and the phenotype/trait/condition of interest. Of course, non-disclosive ’omic data analysis from a single study can also be performed.

The pooled approach (Figure 2.2) is recommended when the user wants to analyze ’omic data from different sources and obtain results as if the data were located in a single computer. It should be noted that this can be very time consuming when analyzing multiple features since it calls a base function in DataSHIELD (ds.glm) repeatedly. It also cannot be recommended when data are not properly harmonized (e.g. gene expression normalized using different methods, GWAS data having different platforms, …). Furthermore when it is necesary to remove unwanted variability (for transcriptomic and epigenomica analysis) or control for population stratification (for GWAS analysis), this approach cannot be used since we need to develop methods to compute surrogate variables (to remove unwanted variability) or PCAs (to to address population stratification) in a non-disclosive way.

The meta-analysis approach Figure 2.3 overcomes the limitations raised when performing pooled analyses. First, the computation issue is addressed by using scalable and fast methods to perform data analysis at whole-genome level at each location The transcriptomic and epigenomic data analyses make use of the widely used limma package that uses ExpressionSet or RangedSummarizedExperiment Bioc infrastructures to deal with ’omic and phenotypic (e.g covariates). The genomic data are analyzed using GWASTools and GENESIS that are designed to perform quality control (QC) and GWAS using GDS infrastructure.

Next, we describe how both approaches are implemented:

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of `ds.glm()`, a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional `glm()` function in R.

Figure 2.2: Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of ds.glm(), a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional glm() function in R.

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.

Figure 2.3: Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.

II Workflows

3 Genome wide association study (GWAS) analysis

Three different use cases will be illustrated:

For all the illustrated cases we will use data divided by chromosome (VCF files). The procedure is the same if all the variants are on a single VCF.

The data used along this section is a synthetic data set generated by the CINECA project made freely available under the Creative Commons Licence (CC-BY) (funding: EC H2020 grant 825775). It has been pruned and the individuals have been separated into three synthetic cohorts. For the “Single cohort” use case, all the individuals are used. Information on individuals and SNP count of each cohort can be found on the Table 3.1.

Table 3.1: Numer of SNPs and individuals by cohort
Cohort 1 Cohort 2 Cohort 3 Total
Number of SNPs 865,240 865,240 865,240 865,240
Number of individuals 817 1,073 614 2,504

3.1 Single cohort

The single cohort analysis is a way of performing a GWAS study guaranteeing GDPR data confidentiality. The structure followed is illustrated on the following figure.

Proposed infrastructure to perform single-cohort GWAS studies.

Figure 3.1: Proposed infrastructure to perform single-cohort GWAS studies.

The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server located at the cohort network. This Opal server contains an array of resources that correspond to the different VCF files (sliced by chromosome)1 and a resource that corresponds to the phenotypes table of the studied individuals.

3.1.1 Connection to the Opal server

We have to create an Opal connection object to the cohort server. We do that using the following functions.

require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')

builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "P@ssw0rd",
               driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)

3.1.2 Assign the VCF resources

Now that we have created a connection object to the Opal, we have started a new R session on the server, our analysis will take place in this remote session, so we have to load the data into it.

In this use case we will use 21 different resources from the GWAS project hosted on the demo Opal server. This resources correspond to VCF files with information on individual chromosomes. The names of the resources are chrXX (where XX is the chromosome number). Following the Opal syntax, we will refer to them using the string GWAS.chrXX.

To load the resources we will use the DSI::datashield.assign.resource() function. Note that along the use case we use the lapply function to simplify our code since we have to perform repetitive operations. lapply is a way of creating loops in R, for more information visit the documentation.

lapply(1:21, function(x){
  DSI::datashield.assign.resource(conns, paste0("chr", x), paste0("GWAS.chr", x))
  })

Now we have assigned all the resources named GWAS.chrXX into our remote R session. We have assigned them to the variables called chrXX. To verify this step has been performed correctly, we could use the ds.class function to check for their class and that they exist on the remote session.

ds.class("chr1")
$cohort1
[1] "GDSFileResourceClient" "FileResourceClient"    "ResourceClient"       
[4] "R6"                   

To resolve the resources and retrieve the data in the remote session we will use the DSI::datashield.assign.expr() function. This function runs a line of code on the remote session,2 and in particular we want to run the function as.resource.object(), which is the DataSHIELD function in charge of resolving the resources.

lapply(1:21, function(x){
  DSI::datashield.assign.expr(conns = conns, symbol = paste0("gds", x, "_object"),
                            expr = as.symbol(paste0("as.resource.object(chr", x, ")")))
})

Now we have resolved the resources named chrXX into our remote R session. The objects retrieved have been assigned into variables named gdsXX_object. We can check the process was successful as we did before.

ds.class("gds1_object")
$cohort1
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"

3.1.3 Assign the phenotypes

The objects we have loaded into our remote session are VCF files that contain genomic information of the individuals. To perform a GWAS this information has to be related to some phenotypes to extract relationships. Therefore, we need to load the phenotypes into the remote session. The phenotypes information is a table that contains the individuals as rows and phenotypes as columns. In this use case, we will use a resource (as with the VCF files) to load the phenotypes table into the remote session.

The procedure is practically the same as before with some minor tweaks. To begin, we have the phenotypes information in a single table (hence, a single resource). Then, instead of using the function as.resource.object, we will use as.resource.data.frame, this is because before we were loading a special object (VCF file) and now we are loading a plain table, so the internal treatment on the remote session has to be different.

DSI::datashield.assign.resource(conns, "pheno", "GWAS.ega_phenotypes")
DSI::datashield.assign.expr(conns = conns, symbol = "pheno_object",
                            expr = quote(as.resource.data.frame(pheno)))

We can follow the same analogy as before to know that we have assigned the phenotypes table to a variable called pheno_object on the remote R session.

ds.class("pheno_object")
$cohort1
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

We can also check the column names to see which information is present on the table.

ds.colnames("pheno_object")[[1]]
 [1] "age_cancer_diagnosis"                "age_recruitment"                    
 [3] "alcohol_day"                         "basophill_count"                    
 [5] "behaviour_tumour"                    "bmi"                                
 [7] "c_reactive_protein_reportability"    "record_origin"                      
 [9] "report_format"                       "cholesterol"                        
[11] "birth_country"                       "actual_tobacco_smoker"              
[13] "date_diagnosis"                      "diabetes_diagnosed_doctor"          
[15] "diastolic_blood_pressure"            "duration_moderate_activity"         
[17] "duration_vigorous_activity"          "energy"                             
[19] "eosinophill_count"                   "ethnicity"                          
[21] "ever_smoked"                         "fasting_time"                       
[23] "alcohol_frequency"                   "hdl_cholesterol"                    
[25] "hip_circumference"                   "histology_tumour"                   
[27] "home_e_coordinate"                   "home_n_coordinate"                  
[29] "e_house_score"                       "s_house_score"                      
[31] "w_house_score"                       "e_income_score"                     
[33] "w_income_score"                      "ldl_direct"                         
[35] "lymphocyte_count"                    "monocyte_count"                     
[37] "neutrophill_count"                   "n_accidental_death_close_gen_family"
[39] "household_count"                     "children_count"                     
[41] "operations_count"                    "operative_procedures"               
[43] "past_tobacco"                        "platelet_count"                     
[45] "pulse_rate"                          "qualification"                      
[47] "cancer_occurrences"                  "reticulocuyte_count"                
[49] "sleep_duration"                      "height"                             
[51] "subject_id"                          "triglycerides"                      
[53] "cancer_type_idc10"                   "cancer_type_idc9"                   
[55] "weight"                              "leukocyte_count"                    
[57] "sex"                                 "phenotype"                          
[59] "age_death"                           "age_diagnosis_diabetes"             
[61] "date_k85_report"                     "date_k86_report"                    
[63] "date_death"                          "enm_diseases"                       
[65] "source_k85_report"                   "source_k86_report"                  
[67] "age_hbp_diagnosis"                   "mental_disorders"                   
[69] "respiratory_disorders"               "cancer_year"                        
[71] "circulatory_disorders"               "digestive_disorders"                
[73] "nervous_disorders"                   "immune_disorders"                   

3.1.4 Merge VCF (genotype) and phenotype information

Arrived at this point, we have 21 VCF objects (each one corresponds to a chromosome) and a phenotypes table on our remote session. The next step is merging each of the VCF objects with the phenotypes table. Before doing that however, we have to gather some information from the phenotypes table. The information to gather is summarized on the Table 3.2.

Table 3.2: Information to gather from the phenotypes table.
Information Details
Which column has the samples identifier? Column name that contains the IDs of the samples. Those are the IDs that will be matched to the ones present on the VCF objects.
Which is the sex column on the covariates file? Column name that contains the sex information.
How are males and how are females encoded into this column? There is not a standard way to encode the sex information, some people use 0/1; male/female; etc. Our approach uses a library that requires a specific encoding, for that reason we need to know the original encoding to perform the translation.
Which is the case-control column? Column name that contains the case/control to be studied.
What is the encoding for case and for control on this column? Similar as the sex column, the case/control will be translated to the particular standard of the software we have used to develop our functionalities3 .

If we are not sure about the exact column names, we can use the function ds.colnames as we did before. Also, we can use the function ds.table1D to check the level names of the categorical variables.

ds.table1D("pheno_object$sex")$counts
       pheno_object$sex
female             1271
male               1233
Total              2504
ds.table1D("pheno_object$diabetes_diagnosed_doctor")$counts
                     pheno_object$diabetes_diagnosed_doctor
Do_not_know                                             612
No                                                      632
Prefer_not_to_answer                                    590
Yes                                                     668
Total                                                  2502

From the different functions we used, we can extract our answers (Summarized on the Table 3.3)

Table 3.3: Summary of options from the example phenotypes table.
Question Answer
Which column has the samples identifier? “subject_id”
Which is the sex column on the covariates file? “sex”
How are males and how are females encoded into this column? “male”/“female”
Which is the case-control column of interest of the covariates? “diabetes_diagnosed_doctor”
What is the encoding for case and for control on this column? “Yes”/“No”

With all this information we can now merge the phenotypes and VCF objects into a type of object named GenotypeData. We will use the ds.GenotypeData function.

lapply(1:21, function(x){
  ds.GenotypeData(x=paste0('gds', x,'_object'), covars = 'pheno_object', columnId = "subject_id",
                  sexId = "sex", male_encoding = "male", female_encoding = "female",
                  case_control_column = "diabetes_diagnosed_doctor", case = "Yes", control = "No",
                  newobj.name = paste0('gds.Data', x), datasources = conns)
})

The objects we just created are named gds.DataXX on the remote session. Now we are ready to perform a GWAS analysis. To perform a GWAS we have to supply this “Genotype Data” objects and some sort of formula in which we will specify the type of association we are interested on studying. The R language has it’s own way of writing formulas, a simple example would be the following (relate the condition to the smoke variable adjusted by sex):

condition ~ smoke + sex

For this use case we will use the following formula.

diabetes_diagnosed_doctor ~ sex + hdl_cholesterol

3.1.5 Performing the GWAS

We are finally ready to achieve our ultimate goal, performing a GWAS analysis. We have our data (genotype + phenotypes) organized into the correspondent objects (GenotypeData) and our association formula ready. The function ds.metaGWAS is in charge of doing the analysis, we have to supply it with the names of the GenotypeData objects of the remote R session and the formula. Since we have 21 different objects, the paste0 function is used to simplify the call.

results_single <- ds.metaGWAS(genoData = paste0("gds.Data", 1:21), model = diabetes_diagnosed_doctor ~ sex + hdl_cholesterol)[[1]]

results_single
# A tibble: 865,240 x 10
   rs        chr        pos n.obs  freq   p.value    Est Est.SE reference_allele
 * <chr>     <chr>    <int> <dbl> <dbl>     <dbl>  <dbl>  <dbl> <chr>           
 1 rs4648446 1       2.89e6  1292 0.512   7.42e-6 -0.323 0.0721 C               
 2 rs4659014 1       1.20e8  1296 0.587   8.74e-6  0.321 0.0723 T               
 3 rs127465~ 1       2.90e6  1294 0.548   9.69e-6 -0.321 0.0725 C               
 4 rs313728  1       8.62e7  1293 0.277   1.08e-5 -0.375 0.0852 C               
 5 rs113272~ 1       1.17e8  1292 0.859   1.08e-5 -0.459 0.104  T               
 6 rs173625~ 1       2.82e7  1293 0.864   2.26e-5 -0.466 0.110  C               
 7 rs292001  1       2.30e7  1293 0.343   2.64e-5  0.327 0.0779 G               
 8 rs741198~ 1       1.47e8  1294 0.932   3.16e-5  0.608 0.146  A               
 9 rs580006~ 1       5.30e7  1293 0.929   3.69e-5  0.590 0.143  C               
10 rs729056~ 1       5.30e7  1294 0.929   4.18e-5  0.586 0.143  G               
# ... with 865,230 more rows, and 1 more variable: alternate_allele <chr>

We can display the results of the GWAS using a Manhattan plot.

manhattan(results_single)
Manhattan plot of the single-cohort results.

(#fig:singleCohortManhattan_plot)Manhattan plot of the single-cohort results.

Moreover, we can visualize the region with the lowest p-value in detail using the function LocusZoom. There are multiple arguments to control the gene annotation, display a different region of the GWAS, display a different range of positions and more. Make sure to check ?LocusZoom for all the details.

LocusZoom(results_single, use_biomaRt = FALSE)
LocusZoom plot of the surrounding region of the top hit SNP.

Figure 3.2: LocusZoom plot of the surrounding region of the top hit SNP.

datashield.logout(conns)

3.2 Multi cohorts

Now all the individuals used for the single cohort use case will be divided into three different synthetic cohorts. The results from each cohort will then be meta-analyzed to be compared to the results obtained by studying all the individuals together. As with the single-cohort methodology illustrated on the prior section, this method guarantees GDPR data confidentiality. The structure for a three cohort study is illustrated on the following figure (note this can be extrapolated for cohorts with a bigger (or lower) partner count).

Proposed infrastructure to perform multi-cohort GWAS studies.

Figure 3.3: Proposed infrastructure to perform multi-cohort GWAS studies.

The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server located at the cohort network. This Opal server contains an array of resources that correspond to the different VCF files (sliced by chromosome)4 and a resource that corresponds to the phenotypes table of the studied individuals.

This use case will portray a complete study methodology, more precisely we will follow this flowchart:

Flowchart to perform multi-cohort GWAS studies. Adapted from @winkler2014quality

Figure 3.4: Flowchart to perform multi-cohort GWAS studies. Adapted from Winkler et al. (2014)

Brief description of each step:

3.2.1 Connection to the Opal server

We have to create an Opal connection object to the different cohorts server. We do that using the following functions.

require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')

builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "P@ssw0rd",
               driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort2", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "P@ssw0rd",
               driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort3", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "P@ssw0rd",
               driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)

It is important to note that in this use case, we are only using one server that contains all the resources (https://opal-demo.obiba.org/), on this server there are all the resources that correspond to the different cohorts. On a more real scenario each one of the builder$append instructions would be connecting to a different server.

3.2.2 Assign the VCF resources

Now that we have created a connection object to the different Opals, we have started three R session, our analysis will take place on those remote sessions, so we have to load the data into them.

In this use case we will use 63 different resources from the GWAS project hosted on the demo Opal server. This resources correspond to VCF files with information on individual chromosomes (21 chromosomes per three cohorts). The names of the resources are chrXXY (where XX is the chromosome number and Y is the cohort A/B/C). Following the Opal syntax, we will refer to them using the string GWAS.chrXXY.

We have to refer specifically to each different server by using conns[X], this allows us to communicate with the server of interest to indicate to it the resources that it has to load.

# Cohort 1 resources
lapply(1:21, function(x){
  DSI::datashield.assign.resource(conns[1], paste0("chr", x), paste0("GWAS.chr", x,"A"))
  })

# Cohort 2 resources
lapply(1:21, function(x){
  DSI::datashield.assign.resource(conns[2], paste0("chr", x), paste0("GWAS.chr", x,"B"))
  })

# Cohort 3 resources
lapply(1:21, function(x){
  DSI::datashield.assign.resource(conns[3], paste0("chr", x), paste0("GWAS.chr", x,"C"))
  })

Now we have assigned all the resources named GWAS.chrXXY into our remote R session. We have assigned them to the variables called chrXX. To verify this step has been performed correctly, we could use the ds.class function to check for their class and that they exist on the remote sessions.

ds.class("chr1")
$cohort1
[1] "GDSFileResourceClient" "FileResourceClient"    "ResourceClient"       
[4] "R6"                   

$cohort2
[1] "GDSFileResourceClient" "FileResourceClient"    "ResourceClient"       
[4] "R6"                   

$cohort3
[1] "GDSFileResourceClient" "FileResourceClient"    "ResourceClient"       
[4] "R6"                   

We can see that the object chr1 exists in all the three servers.

Finally the resources are resolved to retrieve the data in the remote sessions.

lapply(1:21, function(x){
  DSI::datashield.assign.expr(conns = conns, symbol = paste0("gds", x, "_object"),
                            expr = as.symbol(paste0("as.resource.object(chr", x, ")")))
})

Now we have resolved the resources named chrXX into our remote R sessions. The objects retrieved have been assigned into variables named gdsXX_object. We can check the process was successful as we did before.

ds.class("gds1_object")
$cohort1
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"

$cohort2
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"

$cohort3
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"

3.2.3 Assign the phenotypes

The objects we have loaded into our remote sessions are VCF files that contain genomic information of the individuals. To perform a GWAS this information has to be related to some phenotypes to extract relationships. Therefore, we need to load the phenotypes into the remote sessions. The phenotypes information is a table that contains the individuals as rows and phenotypes as columns. In this use case, we will use a resource (as with the VCF files) to load the phenotypes table into the remote sessions.

As with the VCF resources, here we load a different resource to each cohort, as it is expected that each cohort has only the phenotypes information regarding their individuals, therefore different tables. Note that we are assigning the resource to the same variable at each server (pheno), so we only need one function call to resolve all the resources together.

# Cohort 1 phenotypes table
DSI::datashield.assign.resource(conns[1], "pheno", "GWAS.ega_phenotypes_1")

# Cohort 2 phenotypes table
DSI::datashield.assign.resource(conns[2], "pheno", "GWAS.ega_phenotypes_2")

# Cohort 3 phenotypes table
DSI::datashield.assign.resource(conns[3], "pheno", "GWAS.ega_phenotypes_3")

# Resolve phenotypes table
DSI::datashield.assign.expr(conns = conns, symbol = "pheno_object",
                            expr = quote(as.resource.data.frame(pheno)))

We can follow the same analogy as before to know that we have assigned the phenotypes table to a variable called pheno_object on the remote R session.

ds.class("pheno_object")
$cohort1
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

$cohort2
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

$cohort3
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

We can also check the column names to see which information is present on the table.

ds.colnames("pheno_object")[[1]]
 [1] "age_cancer_diagnosis"                "age_recruitment"                    
 [3] "alcohol_day"                         "basophill_count"                    
 [5] "behaviour_tumour"                    "bmi"                                
 [7] "c_reactive_protein_reportability"    "record_origin"                      
 [9] "report_format"                       "cholesterol"                        
[11] "birth_country"                       "actual_tobacco_smoker"              
[13] "date_diagnosis"                      "diabetes_diagnosed_doctor"          
[15] "diastolic_blood_pressure"            "duration_moderate_activity"         
[17] "duration_vigorous_activity"          "energy"                             
[19] "eosinophill_count"                   "ethnicity"                          
[21] "ever_smoked"                         "fasting_time"                       
[23] "alcohol_frequency"                   "hdl_cholesterol"                    
[25] "hip_circumference"                   "histology_tumour"                   
[27] "home_e_coordinate"                   "home_n_coordinate"                  
[29] "e_house_score"                       "s_house_score"                      
[31] "w_house_score"                       "e_income_score"                     
[33] "w_income_score"                      "ldl_direct"                         
[35] "lymphocyte_count"                    "monocyte_count"                     
[37] "neutrophill_count"                   "n_accidental_death_close_gen_family"
[39] "household_count"                     "children_count"                     
[41] "operations_count"                    "operative_procedures"               
[43] "past_tobacco"                        "platelet_count"                     
[45] "pulse_rate"                          "qualification"                      
[47] "cancer_occurrences"                  "reticulocuyte_count"                
[49] "sleep_duration"                      "height"                             
[51] "subject_id"                          "triglycerides"                      
[53] "cancer_type_idc10"                   "cancer_type_idc9"                   
[55] "weight"                              "leukocyte_count"                    
[57] "sex"                                 "phenotype"                          
[59] "age_death"                           "age_diagnosis_diabetes"             
[61] "date_k85_report"                     "date_k86_report"                    
[63] "date_death"                          "enm_diseases"                       
[65] "source_k85_report"                   "source_k86_report"                  
[67] "age_hbp_diagnosis"                   "mental_disorders"                   
[69] "respiratory_disorders"               "cancer_year"                        
[71] "circulatory_disorders"               "digestive_disorders"                
[73] "nervous_disorders"                   "immune_disorders"                   

3.2.4 Merge VCF (genotype) and phenotype information

Arrived at this point, we have 21 VCF objects at each cohort R session (each one corresponds to a chromosome) and a phenotypes table. The next step is merging each of the VCF objects with the phenotypes table. The same procedure as the single cohort can be applied to extract the arguments required.

With all this information we can now merge the phenotypes and VCF objects into a type of object named GenotypeData. We will use the ds.GenotypeData function.

lapply(1:21, function(x){
  ds.GenotypeData(x=paste0('gds', x,'_object'), covars = 'pheno_object', columnId = "subject_id", sexId = "sex",
                male_encoding = "male", female_encoding = "female",
                case_control_column = "diabetes_diagnosed_doctor", case = "Yes", control = "No",
                newobj.name = paste0('gds.Data', x), datasources = conns)
})

The objects we just created are named gds.DataXX on the remote session. Now we are ready to perform a GWAS analysis. We will use the same formula as the single cohort use case to be able to compare the results later:

diabetes_diagnosed_doctor ~ sex + hdl_cholesterol

3.2.5 Performing a meta - GWAS

We now proceed to perform the GWAS analysis.

results <- ds.metaGWAS(genoData = paste0("gds.Data", 1:21), model = diabetes_diagnosed_doctor ~ sex + hdl_cholesterol)

And we can see the results for each cohort.

# A tibble: 865,162 x 10
   rs        chr        pos n.obs  freq   p.value    Est Est.SE reference_allele
 * <chr>     <chr>    <int> <dbl> <dbl>     <dbl>  <dbl>  <dbl> <chr>           
 1 rs579533~ 1       3.90e7   430 0.737   3.28e-7  0.730  0.143 T               
 2 rs113272~ 1       1.17e8   428 0.876   6.44e-7 -0.936  0.188 T               
 3 rs4970551 1       3.90e7   350 0.749   6.02e-6  0.738  0.163 A               
 4 rs9726184 1       2.81e7   430 0.801   6.50e-6  0.702  0.156 C               
 5 rs2906466 1       4.43e7   428 0.551   8.79e-6 -0.563  0.127 A               
 6 rs121172~ 1       1.87e8   428 0.407   8.87e-6 -0.600  0.135 T               
 7 rs1626710 1       2.06e8   428 0.215   1.06e-5 -0.693  0.157 C               
 8 rs109026~ 1       2.81e7   430 0.8     1.10e-5  0.679  0.155 G               
 9 rs1938519 1       1.87e8   429 0.597   1.24e-5  0.587  0.134 A               
10 rs1572823 1       1.99e7   428 0.189   1.24e-5 -0.740  0.169 T               
# ... with 865,152 more rows, and 1 more variable: alternate_allele <chr>

# A tibble: 865,190 x 10
   rs        chr        pos n.obs  freq   p.value    Est Est.SE reference_allele
 * <chr>     <chr>    <int> <dbl> <dbl>     <dbl>  <dbl>  <dbl> <chr>           
 1 rs9426275 1       2.98e7   549 0.756   5.07e-6 -0.666  0.146 C               
 2 rs124057~ 1       8.57e7   549 0.934   9.05e-5  0.917  0.234 A               
 3 rs2762766 1       1.45e8   550 0.52    1.11e-4  1.18   0.305 T               
 4 rs111615~ 1       8.58e7   549 0.779   1.80e-4  0.522  0.139 T               
 5 rs4657641 1       1.67e8   552 0.343   2.54e-4 -0.416  0.114 G               
 6 rs2295623 1       1.60e8   549 0.906   2.96e-4  0.715  0.197 C               
 7 rs1891510 1       1.16e8   550 0.725   2.97e-4  0.499  0.138 T               
 8 rs776826~ 1       2.21e8   551 0.928   3.06e-4 -0.802  0.222 C               
 9 rs120926~ 1       3.13e7   551 0.819   3.68e-4 -0.546  0.153 C               
10 rs6670319 1       1.16e8   552 0.724   3.99e-4  0.488  0.138 T               
# ... with 865,180 more rows, and 1 more variable: alternate_allele <chr>

# A tibble: 849,560 x 10
   rs        chr         pos n.obs  freq  p.value    Est Est.SE reference_allele
 * <chr>     <chr>     <int> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <chr>           
 1 rs4915771 1      62151742   285 0.593  2.77e-5 -0.700  0.167 G               
 2 rs7544580 1      62178523   285 0.593  2.77e-5 -0.700  0.167 T               
 3 rs1385117 1     102112168   286 0.893  3.81e-5  1.09   0.264 A               
 4 rs121437~ 1     172319486   285 0.909  4.07e-5  1.16   0.283 T               
 5 rs7521871 1     102110628   285 0.893  4.19e-5  1.08   0.264 G               
 6 rs133739~ 1     102113861   285 0.893  4.19e-5  1.08   0.264 A               
 7 rs600511~ 1     245057001   286 0.893  6.36e-5  1.08   0.271 T               
 8 rs1938514 1     187349937   285 0.760  8.05e-5 -0.746  0.189 C               
 9 rs121309~ 1     172358588   285 0.919  8.18e-5  1.16   0.295 G               
10 rs761675~ 1     172361011   285 0.919  8.18e-5  1.16   0.295 G               
# ... with 849,550 more rows, and 1 more variable: alternate_allele <chr>

3.2.6 GWAS Plots

We can display the results of the GWAS using a Manhattan plot for each cohort.

manhattan(results[[1]])
Manhattan plot of the Cohort 1.

Figure 3.5: Manhattan plot of the Cohort 1.

manhattan(results[[2]])
Manhattan plot of the Cohort 1.

Figure 3.6: Manhattan plot of the Cohort 1.

manhattan(results[[3]])
Manhattan plot of the Cohort 1.

Figure 3.7: Manhattan plot of the Cohort 1.

3.2.7 Meta-level quality control

Up to this point we have collected the aggregated statistics from each cohort and we have visualized them. Before meta-analyzing this statistics we have to perform some quality control to make sure there are no major issues. There are three different methods bundled in dsOmicsClient to perform such task:

  • EAF Plot: Plotting the reported EAFs against a reference set, such as from the HapMap or 1000 Genomes projects, or from one specific study, can help visualize patterns that pinpoint strand issues, allele miscoding or the inclusion of individuals whose self-reported ancestry did not match their genetic ancestry.
  • P-Z Plot: Plot to reveal issues with beta estimates, standard errors and P values. The plots compare P values reported in the association GWAS with P values calculated from Z-statistics (P.ztest) derived from the reported beta and standard error. A study with no issues should show perfect concordance. It will generate a plot per study, where concordance of all the study SNPs will be displayed.
  • SE-N Plot: Plot to reveal issues with trait transformations. A collection of studies with no issues will fall on a line.

If any of these QC raises some inconsistencies, please contact the correspondent study analyst.

3.2.7.1 EAF Plot

For this plot we are comparing the estimated allele frequencies to a reference set, so the first step is to get ourselves a reference set. For this example we will be using the ALL.wgs.phase1_release_v3.20101123 reference set. We will download it and load it into the client R session.

reference <- readr::read_tsv("https://homepages.uni-regensburg.de/~wit59712/easyqc/exomechip/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.EUR_AF.txt.exm.tab.sexChr.DI.txt.GZ")
reference
# A tibble: 75,292 x 8
   ExmName    chrChr_Pos    Chr    Pos Rs          Ref   Alt     Freq
   <chr>      <chr>       <dbl>  <dbl> <chr>       <chr> <chr>  <dbl>
 1 exm2268640 chr1:762320     1 762320 rs75333668  C     T     0.0013
 2 exm47      chr1:865628     1 865628 rs41285790  G     A     0.01  
 3 exm55      chr1:865694     1 865694 rs9988179   C     T     0.0026
 4 exm67      chr1:871159     1 871159 .           G     A     0.0013
 5 exm106     chr1:874762     1 874762 rs139437968 C     T     0.0013
 6 exm158     chr1:878744     1 878744 .           G     C     0.0026
 7 exm241     chr1:880502     1 880502 rs74047418  C     T     0.0013
 8 exm2253575 chr1:881627     1 881627 rs2272757   G     A     0.63  
 9 exm269     chr1:881918     1 881918 rs35471880  G     A     0.06  
10 exm340     chr1:888659     1 888659 rs3748597   T     C     0.95  
# ... with 75,282 more rows

Now we just use the function eafPlot and we pass to it the GWAS results and the reference set.

eafPlot(x = results, reference = reference, rs_reference = "Rs", freq_reference = "Freq")
EAF Plot.

Figure 3.8: EAF Plot.

The obtained results are the expected, which is a linear correlation of slope one. This indicates the alleles are correctly encoded. For further explanation on how to assess the results produced by this plot, please refer to the Figure 4 of Winkler et al. (2014).

3.2.7.2 P-Z Plot

This plot will assess the issues with beta estimates, standard errors and P values. Here we will only need the results from the ds.GWAS function and pass them to the function pzPlot.

pzPlot(results)
P-Z Plot.

Figure 3.9: P-Z Plot.

Please note that all the SNPs are plotted, for that reason there may be some performance issues if there are many SNPs to be plotted.

A dataset with no issues will show perfect concordance. For further details please refer to the Figure 3 of Winkler et al. (2014).

3.2.7.3 SE-N Plot

Finally, we will assess issues with trait transformations. For this we will only need the results from the ds.GWAS function and pass them to the function seNPlot.

seNPlot(results)
SE-N Plot.

Figure 3.10: SE-N Plot.

If all the study points fall on a line, no issues are present. For further detail please refer to the Figure 2 of Winkler et al. (2014).

3.2.8 Meta-analysis

Given no problems have been detected on the meta-level QC, we will proceed. Up to this point, we have obtained association results for each cohort on the study. The next step is to combine this information using meta-analysis methods to derive a pooled estimate closest to the common truth. Each researcher might have an already built pipeline to do so, or a preferred method; nevertheless, we included a couple methods inside dsOmicsClient. They are the following:

  • Meta-analysis of p-values: Using the sum of logs method (Fisher’s method).
  • Meta-analysis of beta values: Using a fixed-effects model. Methodology extracted and adapted from Pirinen (2020).

3.2.8.1 Assessment of the meta-analysis methods

Previously, we have calculated the same GWAS analysis both using all the individuals and a simulated three-cohort scenario. This gives us the ability to assess how good or bad are the different meta-analysis methods on discovering the closest common truth; to do so, we will compare both meta-analysis to the common truth, the evaluation method will be counting the number of top hits that match the common truth, more specifically we will search for matches on the top twenty hits.

We apply both methods to the multi-cohort use case results:

meta_pvalues <- metaPvalues(results)
meta_bvalues <- metaBetaValues(results)

We can now compare the 20 top hits from the common truth (previously stored on the variable results_single) to the 20 top hits yielded by both meta-analysis methodologies.

# Extract 20 top hits for all three cases
  # All individuals
top_20_complete <- results_single %>% arrange(p.value) %>% head(20)
  # Pvalue meta-analysis
top_20_pval <- meta_pvalues %>% arrange(p.meta) %>% head(20)
  # Beta meta-analysis
top_20_beta <- meta_bvalues %>% arrange(pval) %>% head(20)

# Number of hits shared by all individuals case and beta meta-analysis
sum(top_20_complete$rs %in% top_20_beta$rs)
[1] 13
# Number of hits shared by all individuals case and pvalue meta-analysis
sum(top_20_complete$rs %in% top_20_pval$rs)
[1] 1

The meta-analysis using beta values recovers a higher amount of top SNPs, therefore it can be considerated a more powerful method for GWAS meta-analyses.

3.2.8.2 Visualizing the results of the meta-analysis

We can use the manhattan plot and locus zoom to visualize the results of the meta-analysis as we visualized the single cohorts before. To help assess the meta-analysis behaviour, we will plot together the meta-analysis results (left plot) and the common truth results (right plot).

3.2.8.2.1 Manhatan plot
manhattan(meta_bvalues)
Manhattan plot of the meta-analysis results.

Figure 3.11: Manhattan plot of the meta-analysis results.

manhattan(results_single)
Manhattan plot of the common truth.

Figure 3.12: Manhattan plot of the common truth.

3.2.8.2.2 LocusZoom
LocusZoom(meta_bvalues, pvalue = "p.F", use_biomaRt = FALSE)
LocusZoom plot of the meta-analysis results.

Figure 3.13: LocusZoom plot of the meta-analysis results.

LocusZoom(results_single, use_biomaRt = FALSE)
LocusZoom plot of the common truth.

Figure 3.14: LocusZoom plot of the common truth.

3.2.8.3 Meta-analysis versus pooled linear models

The dsOmicsClient package also bundles a function to fit pooled generalized linear models (GLM) on single (or multiple) SNPs. On this section we will compare the results of the meta-analysis to the pooled GLM models to see which method is closer to the common truth. The meta-analysis method will be the beta values (fixed-effects model).

The study methodology will be similar to the one we just performed. The top 20 hits yielded by the meta-analysis will be selected, those SNPs will be studied using the pooled GLM function. Both results will be compared to the common truth.

We aim to replicate a real scenario with this study, typically there is no access to the common truth (if so there is no point on performing meta-analysis on synthetic multi-cohorts), therefore we simulate the discovery of SNPs of interest via meta-analysis, then we want to assess if studying individually those SNPs using pooled GLM will put us closer to the common truth than what we already achieved.

We have already calculated the top 20 SNPs discovered by the beta meta-analysis.

top_20_beta
# A tibble: 20 x 10
   rs        chr        pos    b.F   se.F        p.F    OR   low    up      pval
   <chr>     <chr>    <int>  <dbl>  <dbl>      <dbl> <dbl> <dbl> <dbl>     <dbl>
 1 rs124597~ 19      4.51e7 -0.530 0.0988    8.23e-8 0.589 0.485 0.714   8.23e-8
 2 rs956025  12      1.19e8 -0.406 0.0850    1.82e-6 0.667 0.564 0.787   1.82e-6
 3 rs8130070 21      4.57e7  0.398 0.0846    2.51e-6 1.49  1.26  1.76    2.51e-6
 4 rs8130914 21      4.68e7 -0.879 0.192     4.41e-6 0.415 0.285 0.604   4.41e-6
 5 rs9980256 21      4.24e7 -0.389 0.0852    4.92e-6 0.677 0.573 0.801   4.92e-6
 6 rs177569~ 14      6.92e7 -0.613 0.134     5.19e-6 0.542 0.416 0.705   5.19e-6
 7 rs112352~ 10      9.88e7  0.766 0.169     6.05e-6 2.15  1.54  3.00    6.05e-6
 8 rs721891  19      5.23e7 -0.491 0.109     6.11e-6 0.612 0.494 0.757   6.11e-6
 9 rs124233~ 12      1.24e7  0.710 0.158     6.77e-6 2.03  1.49  2.77    6.77e-6
10 rs9722322 9       2.79e7  0.345 0.0771    7.49e-6 1.41  1.21  1.64    7.49e-6
11 rs122370~ 9       1.72e7  0.499 0.111     7.50e-6 1.65  1.32  2.05    7.50e-6
12 rs169114~ 10      5.97e7  0.681 0.152     7.74e-6 1.97  1.47  2.66    7.74e-6
13 rs7714862 5       2.39e7  0.375 0.0842    8.30e-6 1.46  1.23  1.72    8.30e-6
14 rs8066656 17      7.65e7 -0.551 0.124     9.28e-6 0.576 0.452 0.735   9.28e-6
15 rs101939~ 2       1.39e8 -0.439 0.0992    9.60e-6 0.645 0.531 0.783   9.60e-6
16 rs101853~ 2       1.39e8 -0.409 0.0927    9.99e-6 0.664 0.554 0.796   9.99e-6
17 rs109558~ 8       1.19e8 -0.333 0.0754    1.01e-5 0.717 0.618 0.831   1.01e-5
18 rs581815  2       2.25e8 -0.343 0.0778    1.06e-5 0.710 0.609 0.827   1.06e-5
19 rs985355  18      2.70e7 -0.359 0.0816    1.10e-5 0.699 0.595 0.820   1.10e-5
20 rs313728  1       8.62e7 -0.389 0.0886    1.11e-5 0.677 0.569 0.806   1.11e-5

Next, we have to fit pooled GLM using the ds.glmSNP function, obviously we have to fit the same model that we did on the GWAS (Diabetes.diagnosed.by.doctor ~ sex + HDL.cholesterol), otherwise the results will not be consistent. Remember that we have the genotype data separated by chromosome, following that we will construct the GLM calls as follows:

s <- lapply(1:nrow(top_20_beta), function(x){
  tryCatch({
    ds.glmSNP(snps.fit = top_20_beta[x, 1], model = diabetes_diagnosed_doctor ~ sex + hdl_cholesterol, genoData = paste0("gds.Data", top_20_beta[x, 2]))
  }, error = function(w){NULL})
})
# Simple data cleaning and column name changes
ss <- data.frame(do.call(rbind, s))
ss <- ss %>% rownames_to_column("rs")
pooled <- ss %>% dplyr::rename(Est.pooled = Estimate, SE.pooled = Std..Error, pval.pooled = p.value) %>% dplyr::select(-c(n, p.adj))
# ds.glmSNP(snps.fit = unlist(top_20_beta[3:5, 1]), model = diabetes_diagnosed_doctor ~ sex + hdl_cholesterol, genoData = paste0("gds.Data", unlist(top_20_beta[3:5, 2])), mc.cores = 3)

We now extract the results of interest from the meta-analysis and the common truth.

# Get data from all individuals study
original <- results_single %>% dplyr::rename(pval.original = p.value, Est.original = Est, SE.original = Est.SE) %>%
  dplyr::select(c(rs, Est.original, SE.original, pval.original))
# Arrange dadta from beta meta-analysis
beta <- top_20_beta %>% dplyr::rename(pval.beta = pval, Est.beta = b.F, SE.beta = se.F) %>%
  dplyr::select(c(rs, Est.beta, SE.beta, pval.beta))
# Merge all data on a single table
total <- Reduce(function(x,y){inner_join(x,y, by = "rs")}, list(original, pooled, beta))

With all the information gathered we can visualize the results on the Table 4.

Table 4: Beta values, standard errors and p-values yielded by the three methods: GWAS of all individuals, meta-analysis of synthetic cohorts, pooled analysis of synthetic cohorts.

SNP id Beta SE P-Value
Original Pooled Meta Original Pooled Meta Original Pooled Meta
rs313728 −0.375 −0.385 −0.389 0.085 0.088 0.089 1.1 × 10−5 1.2 × 10−5 1.1 × 10−5
rs10193985 −0.444 −0.446 −0.439 0.098 0.101 0.099 5.5 × 10−6 1.1 × 10−5 9.6 × 10−6
rs10185343 −0.410 −0.415 −0.409 0.091 0.094 0.093 6.8 × 10−6 1.1 × 10−5 1.0 × 10−5
rs581815 −0.341 −0.344 −0.343 0.077 0.079 0.078 8.4 × 10−6 1.2 × 10−5 1.1 × 10−5
rs7714862 0.376 0.371 0.375 0.081 0.084 0.084 3.2 × 10−6 1.0 × 10−5 8.3 × 10−6
rs10955881 −0.331 −0.331 −0.333 0.073 0.075 0.075 6.5 × 10−6 1.2 × 10−5 1.0 × 10−5
rs9722322 0.353 0.345 0.345 0.076 0.078 0.077 3.3 × 10−6 9.7 × 10−6 7.5 × 10−6
rs12237062 0.497 0.499 0.499 0.108 0.113 0.111 4.3 × 10−6 1.1 × 10−5 7.5 × 10−6
rs112352624 0.685 0.814 0.766 0.160 0.186 0.169 1.9 × 10−5 1.2 × 10−5 6.0 × 10−6
rs956025 −0.406 −0.405 −0.406 0.083 0.086 0.085 9.4 × 10−7 2.3 × 10−6 1.8 × 10−6
rs12423318 0.720 0.729 0.710 0.154 0.166 0.158 3.0 × 10−6 1.1 × 10−5 6.8 × 10−6
rs17756914 −0.621 −0.639 −0.613 0.133 0.141 0.134 2.8 × 10−6 5.8 × 10−6 5.2 × 10−6
rs8066656 −0.550 −0.565 −0.551 0.123 0.131 0.124 8.1 × 10−6 1.7 × 10−5 9.3 × 10−6
rs985355 −0.376 −0.366 −0.359 0.080 0.083 0.082 2.7 × 10−6 9.7 × 10−6 1.1 × 10−5
rs12459712 −0.510 −0.528 −0.530 0.096 0.101 0.099 9.9 × 10−8 1.6 × 10−7 8.2 × 10−8
rs721891 −0.482 −0.506 −0.491 0.108 0.113 0.109 8.0 × 10−6 7.7 × 10−6 6.1 × 10−6
rs8130070 0.401 0.399 0.398 0.082 0.085 0.085 1.1 × 10−6 2.8 × 10−6 2.5 × 10−6
rs9980256 −0.392 −0.388 −0.389 0.082 0.086 0.085 2.0 × 10−6 5.9 × 10−6 4.9 × 10−6
rs8130914 −0.883 −1.028 −0.879 0.191 0.227 0.192 3.6 × 10−6 5.7 × 10−6 4.4 × 10−6

Finally, to conclude this study we can compute the bias and mean squared error (code not shown). The results are shown on the Table 5.

Table 5: Mean squared error and bias for the betas from the meta-analysis and pooled analysis compared to the common truth (all individuals).

MSE Bias
Beta
Pooled 2.1 × 10−3 5.2 × 10−3
Meta 4.2 × 10−4 −2.7 × 10−3

With this results we can conclude that refining the statistics for the SNPs of interest using pooled GLM is advisable. Therefore the recommended flowchart when performing a GWAS using dsOmicsClient+DataSHIELD is the following:

REVISAR ESTOS RESULTADOS

  1. Perform GWAS with dsOmicsClient::dsGWAS
  2. Perform meta-analysis on the results with dsOmicsClient::metaBetaValues
  3. Extract top hits from the meta-analysis.
  4. Perform pooled GLM on top hits with dsOmicsClient::ds.glmSNP

3.2.9 Fast GWAS analysis

An alternate methodology to performing a meta analysis is to use the fast GWAS methodology. This method is named after the performance it achieves, however, it does provide further advantages other than computational speed. The main advantage is that it does not compute the results for each cohort to then be meta-analyzed, it actually computes them using a pooled technique that achieves the exact same results when all the individuals are together than when they are separated by cohorts. This is of great importance as there is no need to worry about cohort-imbalance, which is usually a problem in multi-cohort meta-analysis studies. The only downside of this method is that it does not yield the same results of a traditional GWAS analysis pipeline, the inacuraccy of the results however is really low, detailed information about that can be found on the paper that proposes the method.

The usage is similar to the meta-GWAS with a couple of added options, feel free to check the documentation with ?ds.fastGWAS to see information about them.

results_fast <- ds.fastGWAS(genoData = paste0("gds.Data", 1:21), formula = "diabetes_diagnosed_doctor ~ sex + hdl_cholesterol", family = "binomial", snpBlock = 5000)
datashield.logout(conns)

3.2.9.1 GWAS Plot

We can display the results of the fast GWAS using a Manhattan plot.

manhattan(results_fast, pvalCol = 4)
Manhattan plot of the Cohort 1.

Figure 3.15: Manhattan plot of the Cohort 1.

3.2.9.2 Meta-analysis versus fast GWAS

To finish the GWAS section we can evaluate the results of the beta meta-analysis and the results of the fast GWAS. Both methods will be compared to the ground truth.

As before we can start by seeing the amount of SNPs that both methodologies recover as top hits.

# Extract 20 top hits for all three cases
  # All individuals
top_20_complete <- results_single %>% arrange(p.value) %>% head(20)
  # Fast GWAS
top_20_fast <- results_fast %>% arrange(p.value) %>% head(20)
  # Beta meta-analysis
top_20_beta <- meta_bvalues %>% arrange(pval) %>% head(20)

# Number of hits shared by all individuals case and beta meta-analysis
sum(top_20_complete$rs %in% top_20_beta$rs)
[1] 13
# Number of hits shared by all individuals case and fast GWAS
sum(top_20_complete$rs %in% top_20_fast$rs)
[1] 11

On that regard it seems the meta-analysis is performing better.

We now extract the results of interest from the meta-analysis and the common truth.

# Get data from all individuals study
original <- results_single %>% dplyr::rename(pval.original = p.value, Est.original = Est, SE.original = Est.SE) %>%
  dplyr::select(c(rs, Est.original, SE.original, pval.original))
# Arrange dadta from beta meta-analysis
beta <- top_20_beta %>% dplyr::rename(pval.beta = pval, Est.beta = b.F, SE.beta = se.F) %>%
  dplyr::select(c(rs, Est.beta, SE.beta, pval.beta))
# Arrange data from fast GWAS
fast <- top_20_fast %>% dplyr::rename(pval.fast = p.value, Est.fast = Est, SE.fast = Est.SE) %>%
  dplyr::select(c(rs, Est.fast, SE.fast, pval.fast))
# Merge all data on a single table
total <- Reduce(function(x,y){inner_join(x,y, by = "rs")}, list(original, fast, beta))

With all the information gathered we can visualize the results on the Table 6.

Table 6: Beta values, standard errors and p-values yielded by the three methods: GWAS of all individuals, meta-analysis of synthetic cohorts, fast GWAS.

SNP id Beta SE P-Value
Original Fast Meta Original Fast Meta Original Fast Meta
rs7714862 0.376 0.380 0.375 0.081 0.082 0.084 3.2 × 10−6 3.1 × 10−6 8.3 × 10−6
rs9722322 0.353 0.349 0.345 0.076 0.076 0.077 3.3 × 10−6 4.6 × 10−6 7.5 × 10−6
rs12237062 0.497 0.515 0.499 0.108 0.107 0.111 4.3 × 10−6 1.6 × 10−6 7.5 × 10−6
rs16911403 0.684 0.644 0.681 0.146 0.144 0.152 3.1 × 10−6 7.3 × 10−6 7.7 × 10−6
rs112352624 0.685 0.732 0.766 0.160 0.157 0.169 1.9 × 10−5 3.3 × 10−6 6.0 × 10−6
rs956025 −0.406 −0.385 −0.406 0.083 0.082 0.085 9.4 × 10−7 2.5 × 10−6 1.8 × 10−6
rs12423318 0.720 0.699 0.710 0.154 0.146 0.158 3.0 × 10−6 1.8 × 10−6 6.8 × 10−6
rs12459712 −0.510 −0.521 −0.530 0.096 0.094 0.099 9.9 × 10−8 3.5 × 10−8 8.2 × 10−8
rs8130070 0.401 0.401 0.398 0.082 0.082 0.085 1.1 × 10−6 1.2 × 10−6 2.5 × 10−6
rs9980256 −0.392 −0.380 −0.389 0.082 0.083 0.085 2.0 × 10−6 5.1 × 10−6 4.9 × 10−6

Finally, to conclude this study we can compute the bias and mean squared error (code not shown). The results are shown on the Table 7.

Table 7: Mean squared error and bias for the betas from the meta-analysis and fast GWAS compared to the common truth (all individuals).

MSE Bias
Beta
Fast 5.3 × 10−4 −2.6 × 10−3
Meta 7.2 × 10−4 −4.2 × 10−3

On this brief comparison we’ve seen that the fast GWAS retrieves sensibly less SNPs from the ground truth but it has better performance of bias and MSE. It must be noted that the data used for this GWAS examples does not suffer from any cohort imbalance, so the meta-analysis is expected to yield proper results, on a different environment the fast GWAS is expected to exceed the performance of the meta-analysis in a more remarkable manner.

3.3 Post omic analysis

Up to this point we have analyzed the genotype and phenotype data by performing a genome wide association study. We can further study the yielded results by performing an enrichment analysis.

3.3.1 Enrichment analysis

The GWAS results are a collection of SNPs that displayed a relationship with a given phenotype. Those SNPs may fall inside gene regions, and those genes can be functionally related. Studying this can help researchers understand the underlying biological processes. Those analysis are carried by querying the selected gene to a database, for this example we will be using the DisGeNET database.

3.3.1.1 Annotation of genes

The first step of the enrichment analysis is to annotate the genes that contain the SNPs of interest yielded by the GWAS. There are many ways to perform this step, we propose to use the biomaRt package and retrieve the annotation from the ensembl genome browser.

First we select the SNPs of interest. We have established a p.value threshold to do so.

snps_of_interest <- results_fast %>% dplyr::arrange(p.value) %>% dplyr::select(chr, pos, rs, p.value)  %>% dplyr::filter(p.value < 0.00005) 

And with this list of SNPs we can move to the annotation.

library(biomaRt)
# Perform ensembl connection
gene.ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", GRCh = 37)
# Query SNP by SNP
genes_interest <- Reduce(rbind, apply(snps_of_interest, 1, function(x){
  result <- getBM(
    attributes = c('entrezgene_id', 'chromosome_name', 'ensembl_gene_id', 'start_position', "end_position"), 
    filters = c('start', 'end', 'chromosome_name'), 
    values = list(start = x[2], end = x[2], chromosome_name = x[1]), 
    mart = gene.ensembl)
}))

3.3.1.2 Enrichment analysis query

Having the genes of interest annotated, the final step is to query them to DisGeNET. We will use the function enrichDGN from the package DOSE for it. This package is part of a large ecosystem of packages dedicated to enrichment analysis, this allows for easy representation of the query results.

library(DOSE)
enrichment <- enrichDGN(unlist(genes_interest$entrezgene_id))

3.3.1.3 Enrichment analysis plots

Here we will portray some of the visualizations that can be performed with the query results. For more information and further visualization options please read the official vignette.

3.3.1.3.1 Barplot
# library(enrichplot)
# barplot(enrichment)
3.3.1.3.2 Dotplot
# dotplot(enrichment)
3.3.1.3.3 CNET plot
# library(ggnewscale)
# # convert gene ID to Symbol
# edox <- setReadable(enrichment, 'org.Hs.eg.db', 'ENTREZID')
# cnetplot(edox, foldChange=genes_interest$entrezgene_id)

3.4 Polygenic risk scores

3.4.1 Definition

By checking for specific variants and quantifying them, a researcher can extract the polygenic risk score (PRS) of an individual, which translates to an associated risk of the individual versus a concrete phenotype. A definition of the term reads

“A weighted sum of the number of risk alleles carried by an individual, where the risk alleles and their weights are defined by the loci and their measured effects as detected by genome wide association studies.” (Extracted from Torkamani, Wineinger, and Topol (2018))

The use of PRS markers is very popular nowadays and it has been proved to be a good tool to asses associated risks Escott-Price et al. (2017), Forgetta et al. (2020), Kramer et al. (2020).

When calculating the PRS using DataSHIELD, the actual scores are not returned, as it would not be GDPR compliant to be able to determine the risk of an individual versus a certain disease. The actual results are stored on the servers and optionally can be merged to the phenotypes table. This allows further study of the PRS using summaries or fitting pooled GLM models to assess relationships between phenotypes and PRS scores.

3.4.2 Study configuration

The PRS use case will be set up as a multi-cohort, using the same dataset as the multi-cohort GWAS (same VCF resources). A single cohort PRS could be performed analogously using the single cohort example of the GWAS.

Proposed infrastructure to perform PRS studies.

Figure 3.16: Proposed infrastructure to perform PRS studies.

3.4.3 Source of polygenic scores

There are two ways of stating the SNPs of interest and their weights in order to calculate the PRS.

  • Providing a prs_table (SNPs of interest) table. The prs_table table has to have a defined structure and column names in order to be understood by this function. This table can be formulated using two schemas:

    • Scheme 1: Provide SNP positions. Use the column names: chr_name, chr_position, effect_allele, reference_allele, effect_weight.
    chr_name chr_position effect_allele reference_allele effect_weight
    1 100880328 T A 0.0373
    1 121287994 G A -0.0673
    6 152023191 A G 0.0626
    • Scheme 2: Provide SNP id’s. Use the column names: rsID, effect_allele, reference_allele, effect_weight.
    rdID effect_allele reference_allele effect_weight
    rs11206510 T C 0.0769
    rs4773144 G A 0.0676
    rs17514846 A C 0.0487

The effect_weight have to be the betas (log(OR)).

  • Provide a PGS Catalog ‘Polygenic Score ID & Name.’ If this option is in use, the SNPs and beta weights will be downloaded from the PGS Catalog and will be used to calculate the PRS.

Please note when using a custom prs_table table that it is much faster to use the Schema 1, as the subset of the VCF files is miles faster to do using chromosome name and positions rather than SNP id’s.

3.4.4 Connection to the Opal server

We have to create an Opal connection object to the cohort server. We do that using the following functions.

4 Differential gene expression (DGE) analysis

In this section we will illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the demo Opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment(RSE) with the RNAseq profiling of liver cancer data from TCGA. The resource is located inside the OMICS project. Some summaries of the dataset are the following:

TCGA Liver data
Number of individuals 424
Number of genes 58,037
Number of covariate fields 864

The structure used is illustrated on the following figure.

Proposed infrastructure to perform DGE studies.

Figure 4.1: Proposed infrastructure to perform DGE studies.

The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server. This Opal server contains a resource that corresponds to the TCGA liver RSE.

We illustrate a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.

The following use cases will be illustrated:

4.1 Connection to the Opal server

We have to create an Opal connection object to the cohort server. We do that using the following functions.

require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')

builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "P@ssw0rd",
               driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)

4.2 Assign the RSE resource

Now that we have created a connection object to the Opal, we have started a new R session on the server, our analysis will take place in this remote session, so we have to load the data into it.

In this use case we will use one resource from the OMICS project hosted on the demo Opal server. This resources correspond to RangedsummarizedExperiment dataset. The names of the resource is tcga_liver, we will refer to it using the string OMICS.tcga_liver.

DSI::datashield.assign.resource(conns, "rse_resource", "OMICS.tcga_liver")

Now we have assigned the resource named OMICS.tcga_liver into our remote R session. We have assigned it to a variable called rse_resource. To verify this step has been performed correctly, we could use the ds.class function to check for their class and that they exist on the remote sessions.

ds.class("rse_resource")
$cohort1
[1] "RDataFileResourceClient" "FileResourceClient"     
[3] "ResourceClient"          "R6"                     

We can see that the object rse_resource exists in the server.

Finally the resource is resolved to retrieve the data in the remote session.

DSI::datashield.assign.expr(conns = conns, symbol = "rse",
                            expr = as.symbol("as.resource.object(rse_resource)"))

Now we have resolved the resource named rse_resource into our remote R session. The object retrieved has been assigned into the variable named rse. We can check the process was successful as we did before.

ds.class("rse")
$cohort1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"

4.3 Inspect the RSE

The number of features and samples can be inspected with:

ds.dim("rse")
$`dimensions of rse in cohort1`
[1] 58037   424

$`dimensions of rse in combined studies`
[1] 58037   424

And the names of the features using the same function used in the case of analyzing an ExpressionSet:

name.features <- ds.featureNames("rse")
lapply(name.features, head)
$cohort1
[1] "ENSG00000000003.14" "ENSG00000000005.5"  "ENSG00000000419.12"
[4] "ENSG00000000457.13" "ENSG00000000460.16" "ENSG00000000938.12"

Also the covariate names can be inspected by:

name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)
$cohort1
 [1] "project"                                       
 [2] "sample"                                        
 [3] "experiment"                                    
 [4] "run"                                           
 [5] "read_count_as_reported_by_sra"                 
 [6] "reads_downloaded"                              
 [7] "proportion_of_reads_reported_by_sra_downloaded"
 [8] "paired_end"                                    
 [9] "sra_misreported_paired_end"                    
[10] "mapped_read_count"                             
[11] "auc"                                           
[12] "sharq_beta_tissue"                             
[13] "sharq_beta_cell_type"                          
[14] "biosample_submission_date"                     
[15] "biosample_publication_date"                    

We can visualize the levels of the variable having gender information that will be our condition (i.e., we are interested in obtaining genes that are differentially expressed between males and females).

ds.table1D("rse$gdc_cases.demographic.gender")
$counts
       rse$gdc_cases.demographic.gender
female                              143
male                                281
Total                               424

$percentages
       rse$gdc_cases.demographic.gender
female                            33.73
male                              66.27
Total                            100.00

$validity
[1] "All tables are valid!"

4.4 Pre-processing for RNAseq data

We have implemented a function called ds.RNAseqPreproc() to perform RNAseq data pre-processing that includes:

  • Transforming data into log2 CPM units
  • Filtering lowly-expressed genes
  • Data normalization
ds.RNAseqPreproc('rse', group = 'gdc_cases.demographic.gender', 
                 newobj.name = 'rse.pre')

Note that it is recommended to indicate the grouping variable (i.e., condition). Once data has been pre-processed, we can perform differential expression analysis. Notice how dimensions have changed given the fact that we have removed genes with low expression which are expected to do not be differentially expressed.

ds.dim('rse')
$`dimensions of rse in cohort1`
[1] 58037   424

$`dimensions of rse in combined studies`
[1] 58037   424
ds.dim('rse.pre')
$`dimensions of rse.pre in cohort1`
[1] 40363   424

$`dimensions of rse.pre in combined studies`
[1] 40363   424

4.5 DGE analysis

The differential expression analysis in dsOmicsClient/dsOmics is implemented in the funcion ds.limma(). This functions runs a limma-pipeline for microarray data and for RNAseq data allows:

  • oom + limma
  • DESeq2
  • edgeR

We recommend to use the voom + limma pipeline proposed here given its versatility and that limma is much faster than DESeq2 and edgeR. By default, the function considers that data is obtained from a microarray experiment (type.data = "microarray"). Therefore, as we are analyzing RNAseq data, we must indicate that type.data = "RNAse".

ans.gender <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq")

The top differentially expressed genes can be visualized by:

ans.gender
$cohort1
# A tibble: 40,363 x 7
   id                     n   beta     SE     t   P.Value adj.P.Val
   <chr>              <int>  <dbl>  <dbl> <dbl>     <dbl>     <dbl>
 1 ENSG00000274655.1    424 -12.4  0.0761 -52.1 2.74e-187 1.11e-182
 2 ENSG00000270641.1    424 -10.2  0.461  -43.8 5.21e-160 1.05e-155
 3 ENSG00000229807.10   424 -11.0  0.0603 -39.5 1.08e-144 1.45e-140
 4 ENSG00000277577.1    424 -11.3  0.0651 -36.0 2.27e-131 2.29e-127
 5 ENSG00000233070.1    424  10.9  0.0885  35.5 1.85e-129 1.49e-125
 6 ENSG00000260197.1    424  10.2  0.118   32.9 3.72e-119 2.50e-115
 7 ENSG00000213318.4    424  11.4  0.128   31.9 5.57e-115 3.21e-111
 8 ENSG00000278039.1    424  -7.78 0.0812 -28.8 3.85e-102 1.94e- 98
 9 ENSG00000067048.16   424   9.62 0.0894  27.4 4.72e- 96 2.12e- 92
10 ENSG00000131002.11   424  11.4  0.0924  27.3 9.63e- 96 3.89e- 92
# ... with 40,353 more rows

attr(,"class")
[1] "dsLimma" "list"   

We can verify whether the distribution of the observed p-values are the ones we expect in this type of analyses:

hist(ans.gender$cohort1$P.Value, xlab="Raw p-value gender effect",
     main="", las=1, cex.lab=1.5, cex.axis=1.2, col="gray")

4.6 Surrogate variable analysis

We can also check whether there is inflation just executing

qqplot(ans.gender$cohort1$P.Value)

So, in that case, the model needs to remove unwanted variability (\(\lambda > 2\)). If so, we can use surrogate variable analysis just changing the argument sva=TRUE.

ans.gender.sva <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq",
                       sva = TRUE)

Now the inflation has dramatically been reduced (\(\lambda > 1.12\)).

qqplot(ans.gender.sva$cohort1$P.Value)

We can add annotation to the output that is available in our RSE object. We can have access to this information by:

ds.fvarLabels('rse.pre')
$cohort1
[1] "chromosome" "start"      "end"        "width"      "strand"    
[6] "gene_id"    "bp_length"  "symbol"    

attr(,"class")
[1] "dsfvarLabels" "list"        

So, we can run:

ans.gender.sva <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq",
                       sva = TRUE, annotCols = c("chromosome"))

The results are:

ans.gender.sva
$cohort1
# A tibble: 40,363 x 8
   id                     n   beta     SE     t   P.Value adj.P.Val chromosome
   <chr>              <int>  <dbl>  <dbl> <dbl>     <dbl>     <dbl> <chr>     
 1 ENSG00000274655.1    424 -12.4  0.0761 -52.1 2.74e-187 1.11e-182 chrX      
 2 ENSG00000270641.1    424 -10.2  0.461  -43.8 5.21e-160 1.05e-155 chrX      
 3 ENSG00000229807.10   424 -11.0  0.0603 -39.5 1.08e-144 1.45e-140 chrX      
 4 ENSG00000277577.1    424 -11.3  0.0651 -36.0 2.27e-131 2.29e-127 chrX      
 5 ENSG00000233070.1    424  10.9  0.0885  35.5 1.85e-129 1.49e-125 chrY      
 6 ENSG00000260197.1    424  10.2  0.118   32.9 3.72e-119 2.50e-115 chrY      
 7 ENSG00000213318.4    424  11.4  0.128   31.9 5.57e-115 3.21e-111 chr16     
 8 ENSG00000278039.1    424  -7.78 0.0812 -28.8 3.85e-102 1.94e- 98 chrX      
 9 ENSG00000067048.16   424   9.62 0.0894  27.4 4.72e- 96 2.12e- 92 chrY      
10 ENSG00000131002.11   424  11.4  0.0924  27.3 9.63e- 96 3.89e- 92 chrY      
# ... with 40,353 more rows

attr(,"class")
[1] "dsLimma" "list"   

The function has another arguments that can be used to fit other type of models:

  • sva: estimate surrogate variables
  • annotCols: to add annotation available in the
  • method: Linear regression (“ls”) or robust regression (“robust”) used in limma (lmFit)
  • robust: robust method used for outlier sample variances used in limma (eBayes)
  • normalization: normalization method used in the voom transformation (default “none”)
  • voomQualityWeights: should voomQualityWeights function be used instead of voom? (default FALSE)
  • big: should SmartSVA be used instead of SVA (useful for big sample size or when analyzing epigenome data. Default FALSE)

We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:

To be supplied

We close the DataSHIELD session by:

datashield.logout(conns)

5 Epigenome-wide association analysis (EWAS)

In this section we will illustrate how to perform an epigenome-wide association analysis (EWAS) using methylation data. EWAS requires basically the same statistical methods as those used in DGE. It should be noticed that the pooled analysis we are going to illustrate here can also be performed with transcriptomic data since each study must have different range values. If so, gene expression harmonization should be performed, for instance, by standardizing the data at each study. For EWAS where methylation is measured using beta values (e.g CpG data are in the range 0-1) this is not a problem. In any case, adopting the meta-analysis approach could be a safe option.

The data used in this section has been downloaded from GEO (accession number GSE66351) which contains DNA methylation profiling (Illumina 450K array). Data corresponds to CpGs beta values measured in the superior temporal gyrus and prefrontal cortex brain regions of patients with Alzheimer’s. The data is encapsulated on ExpressionSet objects. Researchers who are not familiar with ExpressionSet can find information here. Notice that data is encoded as beta-values that ensure data harmonization across studies.

In order to illustrate how to perform data analyses using federated data, we have split the data into two synthetic cohorts (split by individuals). We have created two resources on the demo Opal server called GSE66351_1 and GSE66351_2 respectively. They can be found inside the OMICS project. Some summaries of the datasets are the following:

Cohort 1 Cohort 2 Total
Number of CpGs 481,868 481,868 481,868
Number of individuals 100 90 190
Number of covariates 49 49 49
Number of annotation features 37 37 37

The structure used is illustrated on the following figure.

Proposed infrastructure to perform EWAS studies.

Figure 5.1: Proposed infrastructure to perform EWAS studies.

The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server. The Opal servers contain a resource that correspond to the GEO:GSE66351 ExpressionSet (subseted by individuals).

We will illustrate the following use cases:

5.1 Initial steps for all use cases

5.1.1 Connection to the Opal server

We have to create an Opal connection object to the different cohorts server. We do that using the following functions.

require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')

builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "P@ssw0rd",
               driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort2", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "P@ssw0rd",
               driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)

It is important to note that in this use case, we are only using one server (https://opal-demo.obiba.org/), on this server there are all the resources that correspond to the different cohorts. On a more real scenario each one of the builder$append instructions would be connecting to a different server.

5.1.2 Assign the ExpressionSet resource

Now that we have created a connection object to the different Opals, we have started two R sessions, our analysis will take place on those remote sessions, so we have to load the data into them.

In this use case we will use 2 different resources from the OMICS project hosted on the demo Opal server. The names of the resources are GSE66351_X (where X is the cohort identifier 1/2). Following the Opal syntax, we will refer to them using the string OMICS.GSE66351_X.

We have to refer specifically to each different server by using conns[X], this allows us to communicate with the server of interest to indicate to it the resource that it has to load.

# Cohort 1 resource load
DSI::datashield.assign.resource(conns[1], "eSet_resource", "OMICS.GSE66351_1")

# Cohort 2 resource load
DSI::datashield.assign.resource(conns[2], "eSet_resource", "OMICS.GSE66351_2")

Now we have assigned all the resources named into our remote R sessions. We have assigned them to the variables called eSet_resource. To verify this step has been performed correctly, we can use the ds.class function to check for their class and that they exist on the remote sessions.

ds.class("eSet_resource")
$cohort1
[1] "RDataFileResourceClient" "FileResourceClient"     
[3] "ResourceClient"          "R6"                     

$cohort2
[1] "RDataFileResourceClient" "FileResourceClient"     
[3] "ResourceClient"          "R6"                     

We can see that the object eSet_resource exists in both servers.

Finally the resource is resolved to retrieve the data in the remote sessions.

DSI::datashield.assign.expr(conns = conns, symbol = "eSet",
                            expr = as.symbol("as.resource.object(eSet_resource)"))

Now we have resolved the resource named eSet_resource into our remote R session. The object retrieved has been assigned into the variable named eSet. We can check the process was successful as we did before.

ds.class("eSet")
$cohort1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

$cohort2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

5.1.3 Inspect the ExpressionSet

Feature names can be returned by:

fn <- ds.featureNames("eSet")
lapply(fn, head)
$cohort1
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
[6] "cg00000289"

$cohort2
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
[6] "cg00000289"

Experimental phenotypes variables can be obtained by:

fn <- ds.varLabels("eSet")
lapply(fn, head)
$cohort1
[1] "title"            "geo_accession"    "status"           "submission_date" 
[5] "last_update_date" "type"            

$cohort2
[1] "title"            "geo_accession"    "status"           "submission_date" 
[5] "last_update_date" "type"            

The columns of the annotation can be obtained by:

fn <- ds.fvarLabels("eSet")
lapply(fn, head)
$cohort1
[1] "ID"               "Name"             "AddressA_ID"      "AlleleA_ProbeSeq"
[5] "AddressB_ID"      "AlleleB_ProbeSeq"

$cohort2
[1] "ID"               "Name"             "AddressA_ID"      "AlleleA_ProbeSeq"
[5] "AddressB_ID"      "AlleleB_ProbeSeq"

5.2 Single CpG pooled analysis

Once the methylation data have been loaded into the opal server, we can perform different type of analyses using functions from the dsOmicsClient package. Let us start by illustrating how to analyze a single CpG from two cohorts by using an approach that is mathematically equivalent to placing all individual-level (pooled).

ans <- ds.lmFeature(feature = "cg07363416", 
                    model = ~ diagnosis + Sex, 
                    Set = "eSet",
                    datasources = conns)
ans
             Estimate Std. Error   p-value
cg07363416 0.03459886 0.02504291 0.1670998
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"      

5.3 Multiple CpGs pooled analysis

The same analysis can be performed for multiple features. This process can be parallelized using mclapply function from the multicore package (only works on GNU/Linux, not on Windows).

ans <- ds.lmFeature(feature = c("cg00000029", "cg00000108", "cg00000109", "cg00000165"),
                    model = ~ diagnosis + Sex, 
                    Set = "eSet",
                    datasources = conns,
                    mc.cores = 20)

If the feature argument is not supplied, all the features will be analyzed, please note that this process can be extremely slow if there is a huge number of features; for example, on the case we are illustrating we have over 400K features, so this process would take too much time.

If all the features are to be studied, we recommend switching to meta-analysis methods. More information on the next section.

5.4 Full ExpressionSet meta-analysis

We can adopt another strategy that is to run a glm of each feature independently at each study using limma package (which is really fast) and then combine the results (i.e. meta-analysis approach).

ans.limma <- ds.limma(model = ~ diagnosis + Sex,
                      Set = "eSet", 
                      datasources = conns)

Then, we can visualize the top genes at each study (i.e server) by:

lapply(ans.limma, head)
$cohort1
# A tibble: 6 x 7
  id             n    beta      SE     t       P.Value adj.P.Val
  <chr>      <int>   <dbl>   <dbl> <dbl>         <dbl>     <dbl>
1 cg13138089   100 -0.147  0.0122  -6.62 0.00000000190  0.000466
2 cg23859635   100 -0.0569 0.00520 -6.58 0.00000000232  0.000466
3 cg13772815   100 -0.0820 0.0135  -6.50 0.00000000327  0.000466
4 cg12706938   100 -0.0519 0.00872 -6.45 0.00000000425  0.000466
5 cg24724506   100 -0.0452 0.00775 -6.39 0.00000000547  0.000466
6 cg02812891   100 -0.125  0.0163  -6.33 0.00000000731  0.000466

$cohort2
# A tibble: 6 x 7
  id             n    beta      SE     t      P.Value adj.P.Val
  <chr>      <int>   <dbl>   <dbl> <dbl>        <dbl>     <dbl>
1 cg04046629    90 -0.101  0.0128  -5.91 0.0000000621    0.0172
2 cg07664323    90 -0.0431 0.00390 -5.85 0.0000000822    0.0172
3 cg27098804    90 -0.0688 0.0147  -5.79 0.000000107     0.0172
4 cg08933615    90 -0.0461 0.00791 -5.55 0.000000298     0.0360
5 cg18349298    90 -0.0491 0.00848 -5.42 0.000000507     0.0489
6 cg02182795    90 -0.0199 0.0155  -5.36 0.000000670     0.0538

The annotation can be added by using the argument annotCols. It should be a vector with the columns of the annotation available in the ExpressionSet or RangedSummarizedExperiment that want to be showed. To obtain the available annotation columns revisit #ewas-inspect.

ans.limma.annot <- ds.limma(model = ~ diagnosis + Sex,
                            Set = "eSet", 
                            annotCols = c("CHR", "UCSC_RefGene_Name"),
                            datasources = conns)
lapply(ans.limma.annot, head)
$cohort1
# A tibble: 6 x 9
  id          n    beta      SE     t   P.Value adj.P.Val CHR   UCSC_RefGene_Na~
  <chr>   <int>   <dbl>   <dbl> <dbl>     <dbl>     <dbl> <chr> <chr>           
1 cg1313~   100 -0.147  0.0122  -6.62   1.90e-9  0.000466 2     "ECEL1P2"       
2 cg2385~   100 -0.0569 0.00520 -6.58   2.32e-9  0.000466 2     "MTA3"          
3 cg1377~   100 -0.0820 0.0135  -6.50   3.27e-9  0.000466 17    ""              
4 cg1270~   100 -0.0519 0.00872 -6.45   4.25e-9  0.000466 19    "MEX3D"         
5 cg2472~   100 -0.0452 0.00775 -6.39   5.47e-9  0.000466 19    "ISOC2;ISOC2;IS~
6 cg0281~   100 -0.125  0.0163  -6.33   7.31e-9  0.000466 2     "ECEL1P2"       

$cohort2
# A tibble: 6 x 9
  id          n    beta      SE     t   P.Value adj.P.Val CHR   UCSC_RefGene_Na~
  <chr>   <int>   <dbl>   <dbl> <dbl>     <dbl>     <dbl> <chr> <chr>           
1 cg0404~    90 -0.101  0.0128  -5.91   6.21e-8    0.0172 11    "CD6"           
2 cg0766~    90 -0.0431 0.00390 -5.85   8.22e-8    0.0172 6     "MUC21"         
3 cg2709~    90 -0.0688 0.0147  -5.79   1.07e-7    0.0172 11    "CD6"           
4 cg0893~    90 -0.0461 0.00791 -5.55   2.98e-7    0.0360 1     ""              
5 cg1834~    90 -0.0491 0.00848 -5.42   5.07e-7    0.0489 3     "RARRES1;RARRES~
6 cg0218~    90 -0.0199 0.0155  -5.36   6.70e-7    0.0538 8     ""              

Then, the last step is to meta-analyze the results. Different methods can be used to this end. We have implemented a method that meta-analyze the p-pvalues of each study as follows:

ans.meta <- metaPvalues(ans.limma)
ans.meta
# A tibble: 481,868 x 4
   id               cohort1     cohort2   p.meta
   <chr>              <dbl>       <dbl>    <dbl>
 1 cg13138089 0.00000000190 0.00000763  4.78e-13
 2 cg25317941 0.0000000179  0.00000196  1.12e-12
 3 cg02812891 0.00000000731 0.00000707  1.63e-12
 4 cg12706938 0.00000000425 0.0000161   2.14e-12
 5 cg16026647 0.000000101   0.000000797 2.51e-12
 6 cg12695465 0.00000000985 0.0000144   4.33e-12
 7 cg21171625 0.000000146   0.00000225  9.78e-12
 8 cg13772815 0.00000000327 0.000122    1.18e-11
 9 cg00228891 0.000000166   0.00000283  1.38e-11
10 cg21488617 0.0000000186  0.0000299   1.62e-11
# ... with 481,858 more rows

We can verify that the results are pretty similar to those obtained using pooled analyses. Here we compute the association for the top two CpGs:

res <- ds.lmFeature(feature = ans.meta$id[1:2], 
                     model = ~ diagnosis + Sex, 
                     Set = "eSet",
                     datasources = conns)
res
              Estimate  Std. Error      p-value        p.adj
cg25317941 -0.03452376 0.004303781 1.042679e-15 1.057482e-15
cg13138089 -0.13733479 0.017124046 1.057482e-15 1.057482e-15
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"      

We can create a QQ-plot by using the function qqplot available in our package.

qqplot(ans.meta$p.meta)

Here In some cases inflation can be observed, so that, correction for cell-type or surrogate variables must be performed. We describe how we can do that in the next two sections.

5.5 Full ExpressionSet meta-analysis adjusting for surrogate variables

The vast majority of omic studies require to control for unwanted variability. The surrogate variable analysis (SVA) can address this issue by estimating some hidden covariates that capture differences across individuals due to some artifacts such as batch effects or sample quality among others. The method is implemented in SVA package.

Performing this type of analysis using the ds.lmFeature function is not allowed since estimating SVA would require to implement a non-disclosive method that computes SVA from the different servers. This will be a future topic of the dsOmicsClient. Estimating SVA separately at each server would not be a good idea since the aim of SVA is to capture differences mainly due to experimental issues among ALL individuals. What we can do instead is to use the ds.limma function to perform the analyses adjusted for SVA at each study.

ans.sva <- ds.limma(model = ~ diagnosis + Sex, 
                    Set = "eSet",
                    sva = TRUE, annotCols = c("CHR", "UCSC_RefGene_Name"))
ans.sva
$cohort1
# A tibble: 481,868 x 9
   id          n    beta      SE     t  P.Value adj.P.Val CHR   UCSC_RefGene_Na~
   <chr>   <int>   <dbl>   <dbl> <dbl>    <dbl>     <dbl> <chr> <chr>           
 1 cg1313~   100 -0.147  0.0122  -6.62  1.90e-9  0.000466 2     "ECEL1P2"       
 2 cg2385~   100 -0.0569 0.00520 -6.58  2.32e-9  0.000466 2     "MTA3"          
 3 cg1377~   100 -0.0820 0.0135  -6.50  3.27e-9  0.000466 17    ""              
 4 cg1270~   100 -0.0519 0.00872 -6.45  4.25e-9  0.000466 19    "MEX3D"         
 5 cg2472~   100 -0.0452 0.00775 -6.39  5.47e-9  0.000466 19    "ISOC2;ISOC2;IS~
 6 cg0281~   100 -0.125  0.0163  -6.33  7.31e-9  0.000466 2     "ECEL1P2"       
 7 cg2766~   100 -0.0588 0.0198  -6.33  7.48e-9  0.000466 16    "ANKRD11"       
 8 cg1537~   100 -0.0709 0.0115  -6.32  7.83e-9  0.000466 2     "LPIN1"         
 9 cg1552~   100 -0.0446 0.00750 -6.29  8.69e-9  0.000466 10    ""              
10 cg1269~   100 -0.0497 0.00155 -6.27  9.85e-9  0.000475 6     "GCNT2;GCNT2;GC~
# ... with 481,858 more rows

$cohort2
# A tibble: 481,868 x 9
   id         n    beta      SE     t P.Value adj.P.Val CHR   UCSC_RefGene_Name 
   <chr>  <int>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <chr> <chr>             
 1 cg040~    90 -0.101  0.0128  -5.91 6.21e-8    0.0172 11    "CD6"             
 2 cg076~    90 -0.0431 0.00390 -5.85 8.22e-8    0.0172 6     "MUC21"           
 3 cg270~    90 -0.0688 0.0147  -5.79 1.07e-7    0.0172 11    "CD6"             
 4 cg089~    90 -0.0461 0.00791 -5.55 2.98e-7    0.0360 1     ""                
 5 cg183~    90 -0.0491 0.00848 -5.42 5.07e-7    0.0489 3     "RARRES1;RARRES1" 
 6 cg021~    90 -0.0199 0.0155  -5.36 6.70e-7    0.0538 8     ""                
 7 cg160~    90 -0.0531 0.0196  -5.31 7.97e-7    0.0548 17    "MEIS3P1"         
 8 cg012~    90 -0.0537 0.00971 -5.18 1.39e-6    0.0582 7     "HOXA2"           
 9 cg251~    90 -0.0224 0.00736 -5.15 1.57e-6    0.0582 3     "ZBTB20;ZBTB20;ZB~
10 cg074~    90 -0.0475 0.00166 -5.13 1.67e-6    0.0582 22    "C22orf27"        
# ... with 481,858 more rows

attr(,"class")
[1] "dsLimma" "list"   

Then, data can be combined meta-anlyzed as follows:

ans.meta.sv <- metaPvalues(ans.sva)
ans.meta.sv
# A tibble: 481,868 x 4
   id               cohort1     cohort2   p.meta
   <chr>              <dbl>       <dbl>    <dbl>
 1 cg13138089 0.00000000190 0.00000763  4.78e-13
 2 cg25317941 0.0000000179  0.00000196  1.12e-12
 3 cg02812891 0.00000000731 0.00000707  1.63e-12
 4 cg12706938 0.00000000425 0.0000161   2.14e-12
 5 cg16026647 0.000000101   0.000000797 2.51e-12
 6 cg12695465 0.00000000985 0.0000144   4.33e-12
 7 cg21171625 0.000000146   0.00000225  9.78e-12
 8 cg13772815 0.00000000327 0.000122    1.18e-11
 9 cg00228891 0.000000166   0.00000283  1.38e-11
10 cg21488617 0.0000000186  0.0000299   1.62e-11
# ... with 481,858 more rows

And we can revisit the qqplot:

qqplot(ans.meta.sv$p.meta)

The DataSHIELD session must be closed by:

datashield.logout(conns)
Escott-Price, Valentina, Amanda J Myers, Matt Huentelman, and John Hardy. 2017. “Polygenic Risk Score Analysis of Pathologically Confirmed Alzheimer Disease.” Annals of Neurology 82 (2): 311–14.
Forgetta, Vincenzo, Julyan Keller-Baruch, Marie Forest, Audrey Durand, Sahir Bhatnagar, John P Kemp, Maria Nethander, et al. 2020. “Development of a Polygenic Risk Score to Improve Screening for Fracture Risk: A Genetic Risk Prediction Study.” PLoS Medicine 17 (7): e1003152.
Kramer, Iris, Maartje J Hooning, Nasim Mavaddat, Michael Hauptmann, Renske Keeman, Ewout W Steyerberg, Daniele Giardiello, et al. 2020. “Breast Cancer Polygenic Risk Score and Contralateral Breast Cancer Risk.” The American Journal of Human Genetics 107 (5): 837–48.
Pirinen, Matti. 2020. “GWAS 9: Meta-Analysis and Summary Statistics.” https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/GWAS9.html.
Torkamani, Ali, Nathan E Wineinger, and Eric J Topol. 2018. “The Personal and Clinical Utility of Polygenic Risk Scores.” Nature Reviews Genetics 19 (9): 581–90.
Winkler, Thomas W, Felix R Day, Damien C Croteau-Chonka, Andrew R Wood, Adam E Locke, Reedik Mägi, Teresa Ferreira, et al. 2014. “Quality Control and Conduct of Genome-Wide Association Meta-Analyses.” Nature Protocols 9 (5): 1192–212.

  1. The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes.↩︎

  2. The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes.↩︎

  3. It’s important to note that all the other encodings present on the case-control column that are not the case or control, will be turned into missings. As an example, if we have "cancer"/"no cancer"/"no information"/"prefer not to answer" we can specifcy case="cancer", control="no cancer" and all the individuals with "no information"/"prefer not to answer" will be interpreted as missings.↩︎

  4. The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes.↩︎